Method to identify common structures in formatted text documents

ABSTRACT

A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.

This application is a Divisional application of U.S. patent applicationSer. No. 12/634,176, filed on Dec. 9, 2009.

DESCRIPTION Field of the Invention

The present invention relates generally to an improved documentprocessing system and, in particular, to a computer implemented method,document processing system, and computer program product for identifyingthe common syntactical and semantic structures across a plethora offormatted text documents. More specifically, structural properties ofpieces of text from a document collection of similar type areautomatically learned, so that syntactic property rules can be appliedto identify how information from multiple documents can be mergedtogether into a corpus satisfying the concepts and relationships thathave been identified, including the possibility of discovering orre-discovering one or more templates from the collection.

BACKGROUND OF THE INVENTION Description of the Related Art

While there has been prior work in the area of information extractionfrom semi-structured content, techniques disclosed in the presentinvention differ in the method of combining document structures and textstyling for an advantage.

Further, the current invention addresses situations where a commondocument template has been issued and subsequently followed byindividual authors, who try to provide semantically consistent textcontent to the pre-designated segments in the template. In view of thesesituations, an exemplary objective of the present invention is to betterreconstruct the original document template, while still allowing themethod to be robust to minor variations, omissions, or additions to theoriginal.

In addition, the current invention discovers when more than one templatewas used to create a document collection, and identifies what theoriginal templates are likely to be. It then classifies each documentinto the more likely template it might have followed. Themulti-templates-in-a-collection can take place due to poor documentmanagement to mix documents originated from different sources. Veryoften the file names are not sufficiently descriptive to re-separatethem. In order to process the mixed collections of documents, thecurrent invention may be applied to separate them first beforeextracting the textual content within.

Prior art references discovered during preparation of the discussionherein and considered as possibly relevant to the present invention arebriefly described below:

U.S. Pat. No. 6,651,058 to Sundaresan, et al. (Neelakantan Sundaresan,Jeonghee Yi) presented a method to extract concepts and relationships inHTML documents, mainly based on text term frequencies without leveragingdocument structures.

U.S. Pat. No. 5,799,268 to Boguraev (Branimir K. Boguraev) presented amethod to automatically create a help database or index of importantterms through linguistic analysis. Their method uses some limitedsyntactic or styling features such as headings to identify key terms inthe document. There is no attempt in recovering a document template.

US Patent Application Publication No. 2006/0026203 to Tan, et al. (AhHwee Tan, Rajaraman Kanagasabai) focused on identifying key concepts andrelationships from documents using linguistic properties such asnoun-verb-noun. It also takes as input a domain database, which is not arequirement in the present invention.

U.S. Pat. No. 7,149,347 to Wnek (Janusz Wnek) presented a method totrain and classify paper documents scanned in optical characterrecognition technology. A set of training data is required to enableWnek's invention.

U.S. Pat. No. 6,604,099 to Chung, et al. (Christina Yip Chung,Neelakantan Sundaresan) presented a method to discover structures fromordered trees extracted out of HTML documents by tracking the positionof various keywords in the trees. Their invention is limited by the factthat the set of keywords has to be provided as input by the user and isnot automatically learned from the styling hints in the documents.Moreover, the method is not applicable to flat document structure, whichcannot be expressed as an ordered tree.

US Patent Application Publication No. 2006/0288275 to Chidlovskii, etal. (Boris Chidlovskii, Jerome Fuselier) presented a method to classifysemi-structured documents via ordered trees. They apply a Naïve Bayesianclassifier on structural features of ordered trees to extract conceptsfrom semi-structured data. But, the method does not take advantage oftext styling information nor is it applicable to flat documentstructure, which cannot be expressed as an ordered tree.

In contrast to these above-described methods, the present inventionpresents a different approach based on discovering the segmentationscheme and record scheme attributes so that, for example, an originaltemplate or templates can be rediscovered.

SUMMARY OF THE INVENTION

In view of the foregoing, and other, exemplary problems, drawbacks, anddisadvantages of the conventional systems, it is an exemplary feature ofthe present invention to provide a structure (and method) in which aformatted document can be parsed so as to retrieve potential templateentries based on one or more characteristics of the formatting used inthe document.

It is another exemplary feature of the present invention to provide amethod to discover hidden structures in a repository including aplurality of such formatted documents by a technique of clustering orother statistical processing of the characteristics of a plurality offormatted documents being analyzed for potential template entries.

In a first exemplary aspect of the present invention, to achieve theabove features, advantages, and objects, described herein is acomputerized method (and apparatus and computer product having embodiedtherein a set of machine-readable instructions) to identify a commonstructure from a collection of formatted text documents, includingcreating a two dimensional array to record an occurrence of textsegments in the formatted documents, using a processor on a computer;sequentially retrieving documents from the collection of formatteddocuments; parsing each retrieved document, using the processor, intotext segments according to a segmentation scheme and record schemeattributes of a format used in the formatted documents; entering eachoccurrence of the text segments in the retrieved documents into the twodimensional array; selecting common text segments across a majority ofthe documents; creating a one dimensional array and recording thereinfrequencies of adjacent common segment pairs across the documents;selecting high frequency pairs as starting and ending markers oflandmarks; and providing, as an output, a sequence of the landmarks asbeing a common structure of the collection of formatted text documents.

In a second exemplary aspect of the present invention, also describedherein is a computerized method (and apparatus and computer producthaving embodied therein a set of machine-readable instructions) todiscover hidden structures in documents stored in a repository ordocument collection, including retrieving documents from the repository,each retrieved document having one or more previously-identifiedmarkers, each marker serving as a basis for a template entry;clustering, as executed by a processor on a computer, the retrieveddocuments into a plurality of clusters as based on a preset threshold ofa number of markers that are shared by the retrieved documents, eachcluster representing a potential document template; and selecting fromthe plurality of clusters, those clusters that exceed a minimal clustersize, wherein the selected clusters are identified as comprisingdistinct document templates represented by the documents in therepository.

The illustrative embodiments described herein provide a computerimplemented method, data processing system, and computer program productfor identifying the common syntactical and semantic structures across aplethora of formatted text documents. The syntactical structurecomprises a set of landmarks, wherein each landmark is assigned abeginning text marker and an ending text marker based on specific textstrings, symbols and optional text styling such as table cell, bold,italic, underline, etc. Text content in between the markers can then beextracted from documents and mapped to the specific landmark. Thesemantic structure then comprises a set of rules annotated to landmarks,wherein the rules are derived from the formatting of text content. Textcontent of the same landmark from multiple documents can be merged andsummarized by applying these rules.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other exemplary features, aspects, and advantages willbe better understood from the following detailed description ofexemplary embodiments of the invention with reference to the drawings,in which:

FIGS. 1A and 1B exemplarily illustrate portions of formatted documents101, 102 that demonstrate the concept of discovering or re-discoveringunderlying templates;

FIG. 2 shows a block diagram representation of a data processing system200 in which illustrative embodiments may be implemented;

FIG. 3 exemplarily illustrates visually a high level sequence 300 of amethod of the present invention, based upon generation of aco-occurrence matrix;

FIG. 4 exemplarily illustrates a co-occurrence matrix 400 based in parton the first document 100 shown in FIG. 1A;

FIG. 5 exemplarily illustrates at a high level summary 500 of a secondaspect of the present invention wherein clusters are formed in theco-occurrence matrix of documents in a repository, in order to generatepossible templates represented by these documents and to discover hiddenstructures in the formatted documents;

FIG. 6 depicts an exemplary flow diagram 600 of segmenting textdocuments and extracting attributes associated with the segments;

FIG. 7 illustrates exemplary steps 700 to construct a two-dimensionalarray recording the occurrence of test segments;

FIG. 8 illustrates exemplary steps 800 to select text segments to form acandidate set of landmark markers;

FIG. 9 depicts exemplary steps 900 to count the occurrence of markerpairs across documents;

FIG. 10 is an exemplary flow diagram 1000 for the process of selectingtop landmark candidates;

FIG. 11 illustrates exemplary steps 1100 to extract formatting andstyling attributes from the content of a landmark and to annotate thelandmark with predefined rules;

FIG. 12 illustrates an exemplary application of landmark rules 1200 tosummarize content from two or more documents;

FIG. 13 illustrates an example 1300 of summarizing the table of contentsin two documents into a single table.

FIG. 14 illustrates in more detail an exemplary method 1400 of a secondaspect summarized in FIG. 5;

FIG. 15 illustrates an exemplary hardware/information handling system1500 for incorporating the present invention therein; and

FIG. 16 illustrates a signal bearing storage medium 1600 (e.g., storagemedium) for storing steps of a program of a method according to thepresent invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS OF THE INVENTION

Referring now to the drawings, exemplary embodiments of the method andstructures according to the present invention will now be described.

The present invention was initially developed as an automated mechanismto assist in cleansing of documents generated, for example, by ateamworking on a service engagement, largely conforming to a general, ifnot vague, previous project-based template. Over time, the originaltemplate, as well as the template used by the team for its latest work,has evolved, including evolution during the latest team's efforts. Thatis, this latest team has itself possibly made various modifications,based on the unique problems encountered during the process ofdeveloping its latest service engagement. FIGS. 1A and 1B exemplarilyshow portions of documents that will be used to illustrate the methodsof the present invention.

FIG. 2 shows a pictorial representation of a data processing system 200in which illustrative embodiments described below may be implemented.The system includes one or more central processing unit (CPU) 202, mainmemory 204, and one or more storage devices 206. Code or instructionsimplementing the processes of the illustrative embodiments are executedby the CPU 202 and located temporarily in the main memory 204. Thestorage devices 206 are used to store the instructions as well asformatted text documents to be processed by the system.

The automated tool 200 of the present invention can work with any numberof such documents 101, 110 exemplarily illustrated in FIGS. 1A and 1B,each representing a similar engagement effort, or can be used forcleansing a single document. Moreover, although the present inventionwas developed and will be discussed herein in the context of serviceengagement documents and exemplary document formats, one of ordinaryskill in the art will readily recognize that it has applications inother areas and formats.

One exemplary goal of the present invention is to discover theproject-defined templates represented by any, some, or all of theformatted documents stored in a database, thereby providing an automatedprocess to extract the project-defined templates represented by thedatabase and based on a specified format. This template extraction iscurrently done manually, with the intent that, for future serviceengagement efforts, content created for one customer could be reused forother customers in a similar scope of effort.

Thus, in one exemplary embodiment, the present invention is directed tothe problem of harvesting textual descriptions from fragments offormatted documents that are largely conforming to a vagueproject-defined template in order to discover one or more overallproject-defined template or templates.

For example, a specific service engagement document might have atemplate that includes headings such as “process narrative”,“identification”, “description”, “process model”, “regulatory impact”,“organizational change”, “gaps”, etc. The tool of the present inventionwill automatically parse out a listing of text string fragments from aformatted document as potentially useful to serve as template subjectheadings (e.g., landmarks) for another service engagement team would useto fill in specific information related to their service engagement. Aswill be explained in more detail below, the method of the presentinvention starts by parsing a formatted document to initially discoverthe markers within the formatted document, based on the types of markersused that specific formatted document, which will then serve ascandidates for discovering landmarks that might serve in a template,including potentially, landmarks having an associated text field to berecognized and filled in by a user using that template.

As mentioned above, the reason for discovering (or re-discovering) atemplate represented by documents in such a database is that, at thediscretion of project managers and client preferences, new projecttemplates are evolving over time. In the current method, documentsresulting from project-specific templates are submitted to a harvestingand cleansing team, which has the task of opening each such document,one at a time, examining the document, and copying it to a commontemplate as a cleansed document.

The present invention provides a research-developed automatic cleansingtool aimed at streamlining, if not completely eliminating, this manualtemplate cleansing process. Manual intervention is only required whenthe template cannot be reliably identified, which often implies thedocument collection might not have followed a common structure in thefirst place.

As exemplarily illustrated by the above exemplary listing of templateheadings, one of the problems to be solved in the context of the presentinvention is that of inferring and declaring landmarks (e.g., textsegments of interest), based on determining beginning and ending markersfor landmarks. A service engagement document might be formatted in aMicrosoft Word document saved in XML, having the text strings that mightbe useful as landmarks, such as headings, paragraphs, lists, tables,lists in tables, etc. Markers can be signaled by a variety of visualcues, including, for example, uppercase font, bold or italic letters,separate lines, etc., and markers can be a mixture of content andformatting styles.

A second exemplary problem is that of determining hidden structures indocuments whose landmarks have been deciphered (e.g., reconstructpotential templates represented by the documents under analysis). Thehidden structure can be determined by clustering or other statisticalprocessing, as will be described in more detail shortly.

It is further noted that, although a document formatted in MicrosoftWord is used for demonstrating the methods of the present invention, themethod can clearly be applied to other formats, such as, for example,spreadsheets and presentation slides. The current invention is also notlimited to the Microsoft technology and can be more generalized toanalyze other structured text formats.

The phrase “formatted text document”, as referred to herein, is definedas a sequence of characters and words that have applied presentationalstyles to convey semantic meanings for human consumption. For example,as exemplarily demonstrated in FIG. 1A, a Microsoft Office Word documentmay have the characters and words formatted with numeric headings, bold,italic, underline, tables, bullets, etc. Alternatively, a MicrosoftNotepad document may have line returns, extra space or labelingcharacters to signal formatting. Consistent document formatting, alsoknown as using a document template, is often encouraged and applied inteam projects where document exchanges take place among team members.Large software development projects often require design documentsfollowing a certain format to ensure completeness and consistency.

Thus, a document can be viewed as a collection of character sequencesand objects interspersed with formatting information, such as common inMS Word as represented in WordML XML or Lotus Symphony. In the presentinvention, the formatting information is used as the starting point todiscover template information.

Team-based document creation is widespread in, for example, documentsfor services engagements and software design documentation. Suchdocuments typically start from mandated templates which reduce documentstructural variations but cannot prevent them. Such documents are oftenstored in repositories and supported by key-word based searching. Thesedocuments often involve multiple documents for single clients, eachclient being associated with multiple types of documents, as well asdocuments from different clients. One problem addressed in the presentinvention is that of finding hidden structures in such documents andimproving activities that consume or produce them.

From such information can then be deduced such aspects as how a teamworked to create the documents, the nature of starting a template, howthe repository was created from content from different clients anddocument types, along with possibly improving any or all of the aboveaspects.

The illustrative embodiments provide automated methods to discover andidentify common structures shared among formatted text documents. Thetechnique applied does not require the original document template, sincethe common structure is inferred from its majority existence in thedocument collection.

The common structure comprises a sequence of landmarks, each of whichhas a beginning text marker, an ending text marker and text contentbetween the markers. A text marker is a special sequence of charactersor words with associated format in the document collection. A textmarker is used to identify positions of text in a document. A beginningtext marker sets the beginning position of text content belonging to thelandmark. An ending text marker sets the ending position of text contentbelonging to the landmark. The text content in a landmark does notcontain text markers. While a text marker may appear in one or morepositions in a document, the pair of a beginning marker and an endingmarker uniquely identifies the content of the landmark.

Thus, landmarks are discovered by initially extracting candidates from aformatted document by pre-defining one or more specific text markersused in a specific format of a document being parsed and determiningwhich of the candidates should become landmarks for a template, in amechanism described shortly, and any associated text content, if any,can then be extracted and mapped thereto.

As an example of obtaining ordered objects from a document underanalysis, the first six results from a formatted document undergoingparsing for paragraphs, styles, and tree depths might be (e.g.,reference document 100 of FIG. 1A):

1 italic, tablecell, 0000FF, Process 2 italic, tablecell, FF0000,<process> 3 italic, tablecell, 0000FF, Team 4 italic, tablecell, FF0000,<team> 5 italic, tablecell, 0000FF, Owner 6 italic, tablecell, FF0000,<owner>

Note that the above examples are based upon a format from within cellsof a table having labels “Process”, “Team”, and “Owner”, along withassociated contents “<process>”, “<team>”, and “<owner>”, as indicatedby italic font. Thus, the format characteristics of interest inextracting landmarks from this document would be tablecell location0000FF (color blue) and, possibly, “italic” format.

Some of these table cells are associated with text content, such as“BAR-Budget Analysis and Reporting” being associated with the table cell“Team” and “Mary Lou K.” being text content associated with the tablecell “Owner”. Moreover, other sections in the document 100 outside of atable cell, such as “Description” 105 and “Triggers” 106 would also beexpected to be discovered by the automated tool as candidate landmarksfor a template, so there are multiple formatting details that can beutilized by the tool to discover potential template landmarks within adocument being processed.

FIG. 3 shows a high level perspective 300 of a first exemplaryembodiment of the present invention. Each document of interest isretrieved 301 and parsed 302, so that, in a third step 303, a sequenceof ordered objects can be extracted therefrom, to serve as candidates ina listing that can be selected to become potential landmarks of atemplate. In a fourth step 304, the ordered objects from the documentare placed into a co-occurrence matrix, so that, after all documents ofinterest have been analyzed 305 for representation of landmarks in theco-occurrence matrix, in a fifth step 306, one or more landmark draftscan be generated from the co-occurrence matrix for proposal to a user asa possible template.

FIG. 4 shows exemplarily a possible co-occurrence matrix 400 for theordered objects listed above (e.g., from document 100 in FIG. 1A), asthese objects might appear in various documents in a repository that arepossibly related by a common ancestor template (e.g., Doc 2, . . . DocN).

FIG. 5 shows visually a high level perspective 500 of a second exemplaryaspect of the present invention to be discussed in more detail later,wherein the ordered objects (e.g., the co-occurrence matrix) can then beclustered, in step 501, as a mechanism to analyze content of thedocuments, in order to derive information for the template creation tool(e.g., discover hidden structures in the documents of interest) todiscover or re-discover possible templates underlying the documents, asreported in step 502.

This second aspect is used to group subsets of documents in acollection, where each subset may be following a different originaltemplate. This situation can happen frequently in practice since poordocument management systems can mix documents originated from differentsources together. The first step is thus to attempt to re-separate them.Possible inputs for the automated tool in this aspect include clustersize 503 and number of templates 504 expected in the repository ofdocuments.

Turning now to FIG. 6, a flow diagram 600 of segmenting text documentsand extracting attributes associated with the segments. The flow startsin step 602 with the declaration of a text segmentation scheme. Thesegmentation scheme is dependent on the text document formatting, suchas Microsoft Office Word, Microsoft Notepad, Lotus Symphony Documents,etc. The segmentation scheme is an input to the present invention, dueto its dependency on specific document formatting.

A segmentation scheme is preferred to define boundaries between textsegments in a formatted text document. The boundaries may be paragraphs,empty lines, table cells or other semantically meaningful separators.For example, in Microsoft Office Word documents formatted in the WordMLlanguage, the <w:p> tag is a paragraph separator. A segmentation schememay use <w:p> tags found in a Word document to parse the document textinto paragraphs.

Steps 604-610 iterate over text documents in the storage space. Adocument is first read, in step 604, and then dissected in step 606according to the declared segmentation scheme. For each segmented text,its scheme attributes are then recorded in step 608. Scheme attributesare defined as presentation formatting instructions for semanticinterpretation. For example, italic, bold, bullet, numbered, heading,table and so on may be defined as scheme attributes, which are recordedin association with segmented text. In addition, if the document ishierarchical, such as HTML or XML, the path from the root node of thehierarchy to the current text segment may also be included as a schemeattribute.

If there are no more documents to be read, for each document, thesegments and their attributes are output in the order of occurrence 612.

The steps 700 to process the output as step 612 are illustrated in FIG.7. In step 702, the system first creates a two-dimensional array withdocument ID as the row index and text segment ID as the column index.The assignments of row and column can be interchanged, without loss ofgenerality. This two-dimensional array does not have a fixed size.Rather it expands as new rows and columns are inserted.

Steps 704-710 iterate over each document and their segments. That is,for each document, a new document ID is assigned to index the row in thearray. For the document, in step 706 it is checked whether each textsegment has already been given an ID. If there is no ID, in step 716 anew column ID is added to the array. The new column will have all thecells, across all the rows, set at zero initially. Then array cell at<document ID, segment ID> is incremented by one, in step 708. If a textsegment has an ID already, step 716 is skipped and the cell isincremented by one directly in step 708. In step 710, the iterationrepeats until all the text segments in a document are entered into thearray.

If there are more unread documents, in step 712, the array will continueto be populated with counts by iterating over another document. Finally,this two-dimensional array is output for use, in step 714.

Turning now to FIG. 8, where the steps are illustrated to choose themost commonly appeared text segments across all the documents. Takingthe array from 714, the counts by columns are computed, optionally usingweighting assigned by a user, as indicated by step 804. By default, thescheme attributes associated with the text segments are equallyweighted, as indicted in step 802. For example, text segments formattedwith bold characters are treated equally with those segments without.

However, it is known from experience that document templates often tendto emphasize sections of text by special formatting. Such convention mayprovide advantage in recovering the template if text segments withspecial formatting are weighed higher in becoming candidates forlandmark markers. Users optionally may decide to increase or decreasethe weighting factor of scheme attributes associated with text segments(step 804).

In step 806, the counts in a column are summed, with step 808 indicatingthat the per-column counts are optionally adjusted by their weightingfactors.

The adjusted totals are then sorted in descending order, where K columnsare selected in step 810 from a user-specified value range. In ourexperience, columns with high adjusted totals relative to the size ofthe entire document collection may not be good landmark markers. Therule of thumb is that the total should be less than three times of thecollection size. Similarly, columns with low adjusted totals areimprobable landmark markers. The user may, for example, set the lowthreshold at half of the collection size.

The high and low watermarks are meant to improve the accuracy of markeridentification. Experimental evaluations have suggested theeffectiveness of the present invention is not significantly affected bythe precise value of the user specified range, since there are othercompensating steps to follow.

Landmark marker identification is performed over these text segments812, and FIG. 9 and FIG. 10 illustrate the steps to identify landmarks.

First, in step 902, a one-dimensional array is created, as uniquelyindexed by a pair of markers. The array is started empty and new entrieswill be inserted in the following steps. Revisit the two-dimensionalarray from step 714 of FIG. 7. In step 904, for every row, scan from thefirst column to the last column. If a column ID, C2, is in the candidateset, in step 906, create a pair <C1, C2>, where C1 is the column ID ofthe previously encountered marker candidate. Alternatively, as shown instep 907, if there is no C1, as in the beginning of the document, createa pair <*,C2>, and, similarly, if the end of the row is reached, createa pair <C1,*>.

If the pair <C1,C2> is indexed in the one-dimensional array, incrementthe indexed cell by one, as shown in steps 908, 908 a. If <C1,C2> is notfound, insert an index entry <C1,C2> with the value of one, as shown instep 910. As shown in step 912, the iteration goes on for each columnuntil the end of the current row. Steps 906-912 are repeated for eachrow in the two-dimensional array.

FIG. 10 continues from FIG. 9, as demonstrating steps in an exemplarymethod 1000 for the selection the landmark candidates. First, in step1002, the top-L <C1,C2> pairs are selected, based on their count valuesin descending order. The parameter L is user defined. In practice, inone exemplary embodiment, the text segment pairs <C1,C2> are presentedto the human user, who decides whether the proposed landmarks aresemantically meaningful and useful to extract the text content. C1 andC2 are the starting and ending text markers, respectively.

Turning now to FIG. 11, as suggested by the entry 1006 into thisprocessing, a landmark not only has markers but also has schemeattributes that are useful to merge and combine the extracted text frommultiple documents. For a landmark <C1,C2>, first, in step 1102, theoriginal text in between C1 and C2 is extracted from the documents. Itshould be noted that this step is different from 606 and 608 of FIG. 6,since text in a landmark typically spans more than one text segment. Thepresentation formatting and styling information associated the text isthen extracted in step 1104, and the most common format and styles arethen mapped to a user-defined set of rules in step 1106. The rulesassociate formatting with semantically meaningful interpretation of thestyle. For example, a rule may state the bullet formatting is mapped toan unordered list without duplicates; another rule may state thenumbered formatting is mapped to an ordered list without duplicates.These rules are then annotated to the landmark <C1,C2> in step 1108.

Annotated landmark rules may be used to summarize or combine textualcontent from two or more documents, as illustrated in the steps of FIG.12. Previously, textual content from multiple sources is simplyconcatenated together to preserve its semantic meaning. With thetechnique described below, the landmark rules can be used to bettermerge content and highlight similarities and differences.

Steps 1204, 1206, 1208, and 1210 serve as examples of landmark rules tocharacterize the semantic structures of text content. Two or more textbelonging to the same landmark but coming from multiple documents can besummarized by applying these rules 1200. For example, if a rule states‘unordered list without duplicates’ 1204, lists from multiple documentscan be merged with duplicates removed, as indicated in step 1205. If arule states ‘numbered list without duplicates’ 1206, list ordering mustbe preserved and only duplicates with the same number can be removed, asshown in 1207. If a rule states ‘name-value pairs’ 1208, name-valuepairs of text are grouped by the name 1209. If a rule states ‘unorderedtable without duplicates’ 1210, read tables of text and remove redundantrows 1211.

FIG. 13 illustrates an example 1300 of merging tables by appendingadditional columns. Document 1 has four columns 1302 and so doesdocument 2 (e.g., 1304). A merger 1306 of the two documents has thefirst three columns identical to each other and create two new columns,one from the fourth column in Document 1 and one from the fourth columnin Document 2. The merged table now has five columns, which in this casebetter and more concisely represent a summary of the original content.

The description of the illustrative embodiments above has been presentedfor purposes of illustration and description, and is not intended to beexhaustive or limited to the invention in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art. The embodiment was chosen and described in order to bestexplain the principles of the invention, the practical application, andto enable others of ordinary skill in the art to understand theinvention for various embodiments with various modifications as aresuited to the particular use contemplated.

A second aspect of the present invention involves analyzing documentsfor structural patterns and extracting content, based on the aboveconcepts of locating landmarks in one or more documents. In practice,quite often a document collection may consist of multiple subsets ofdocuments with each subset following a different template. Directlyapplying the previously described steps in the first aspect of theinvention will lead to inaccurate landmarks and their markers.

This aspect of the invention first clusters the segments common tosubsets of a document collection. If many documents were associated witha cluster, these documents are more likely to follow the same originaltemplate. As part of this approach, statistics of structural patternsand extracted content can also provide feedback on activities related tocreating or consuming the documents. This aspect was summarized in FIG.5.

FIG. 14 shows an exemplary flowchart 1400 of steps of this second aspectthat can be used to discover the hidden structures in any number ofdocuments of interest in a database.

In step 1401, for each document with markers, a co-occurrence matrix iscreated to record document/marker pairs, in the manner previouslydescribed. In step 1402, a minimal cluster size is defined, using asinputs such parameters as intra/inter cluster distance, maximaloverlapping, and possibly other user-defined cluster metrics, that willbe accepted as a distinct document template.

In step 1403, the documents are clustered, based on a preset thresholdof the number of shared markers. Step 1404 shows that the shared markerscan optionally be weighted based on parameters such as popularity,styling, special characters, etc.

In step 1405, the qualities of the clusters are measured and, ifdesired, the threshold adjusted, thereby perhaps returning to steps 1402and 1403. This step 1405 might also be subject to review by the user toprovide inputs.

In step 1406, the tool counts and reports on the number of distinctdocument templates and associated documents.

Thus, FIG. 14 demonstrates an exemplary method for an automated surveytool that can selectively analyze an entire document collection and iscapable of performing either of the case wherein no background knowledgeof the number of templates followed or the case wherein K templatesknown as being followed.

In the case where there is no knowledge of the number of templatesfollowed, the tool expects an input of a plurality of tagged documents,where tags will be referred to as markers. Next, the documents areclustered, based on a preset threshold on the number of shared markers,where the shared markers may be optionally be weighed on variousfactors, including popularity, prior knowledge, etc. Next, a minimalcluster size is set, in fraction of the total repository or in absolutenumber, that would be accepted as a distinct document template. Finally,the number of distinct documents templates is counted and reported,along with associated documents.

In the case where it is known that K templates are followed in thedocuments under analysis, the initial steps are similar to thosedescribed above, but the tool then counts and reports whether the numberof distinct document templates was K and returns the associateddocuments.

As one example related to team organization, as background knowledge,the documents should follow a single template and are set of a singletype. Statistics about markers are bi-modal, pointing to the existenceof two templates. As feedback, a sub-team emerged in the project thatcreated the second template.

In a second example related to template design, where the initialtemplate is available as background knowledge, the extracted landmarksshowed more structural regions of useful knowledge, so that the templatecould be extended with new fields.

The automated template creation tool of the present invention performstwo steps. In a first step, for each template, a set of landmarks iscreated that define common structural regions containing usefulinformation in the documents. In a second step, for each document, arelevant landmark set is identified and contents of the landmarks areextracted. The content of a landmark is annotated with that landmark asits metadata. A future user of the template would use this metadata torecognize what specific information is to be filled into the landmark inits application in the template.

The template creation tool has the characteristics that it works whenthere is no information about the number of templates followed or thenumber of documents used to derive it. That is, a single document couldbe used by the template creation tool. The template creation tool alsoensures that all possible markers are captured. The template creationtool also permits a user to oversee the process.

Exemplary Hardware Implementation

FIG. 15 illustrates a typical hardware configuration of an informationhandling/computer system in accordance with the invention and whichpreferably has at least one processor or central processing unit (CPU)1511.

The CPUs 1511 are interconnected via a system bus 1512 to a randomaccess memory (RAM) 1514, read-only memory (ROM) 1516, input/output(I/O) adapter 1518 (for connecting peripheral devices such as disk units1521 and tape drives 1540 to the bus

1512), user interface adapter 1522 (for connecting a keyboard 1524,mouse 1526, speaker 1528, microphone 1532, and/or other user interfacedevice to the bus 1512), a communication adapter 1534 for connecting aninformation handling system to a data processing network, the Internet,an Intranet, a personal area network (PAN), etc., and a display adapter1536 for connecting the bus 1512 to a display device 1538 and/or printer1539 (e.g., a digital printer or the like).

In addition to the hardware/software environment described above, adifferent aspect of the invention includes a computer-implemented methodfor performing the above method. As an example, this method may beimplemented in the particular environment discussed above.

Such a method may be implemented, for example, by operating a computer,as embodied by a digital data processing apparatus, to execute asequence of machine-readable instructions. These instructions may residein various types of signal-bearing media.

Thus, this aspect of the present invention is directed to a programmedproduct, comprising signal-bearing media tangibly embodying a program ofmachine-readable instructions executable by a digital data processorincorporating the CPU 1511 and hardware above, to perform the method ofthe invention.

This signal-hearing media may include, for example, a RAM containedwithin the CPU 1511, as represented by the fast-access storage forexample. Alternatively, the instructions may be contained in anothersignal-bearing media, such as a magnetic data storage diskette 1200(FIG. 12), directly or indirectly accessible by the CPU 1511.

Whether contained in the diskette 1600, the computer/CPU 1511, orelsewhere, the instructions may be stored on a variety ofmachine-readable data storage media, such as DASD storage (e.g., aconventional “hard drive” or a RAID array), magnetic tape, electronicread-only memory (e.g., ROM, EPROM, or EEPROM), an optical storagedevice (e.g. CD-ROM, WORM, DVD, digital optical tape, etc.), paper“punch” cards, or other suitable signal-bearing storage media includingmemory devices in transmission media, whether stored in formats such asdigital or analog, and in communication links and wireless devices. Inan illustrative embodiment of the invention, the machine-readableinstructions may comprise software object code.

The present invention addresses the need to discover/re-discover commontemplate structures that are otherwise hidden in text formatting. Theinvention is a critical first step to extract, assimilate, analyze andreuse textual content spanning across multiple documents. Theself-learning and automation saves precious time and delivers accuracyin practice. Most service artifacts including software design, businessconsulting and legal proceedings can be recovered using the methodsdescribed above.

While the invention has been described in terms of a single preferredembodiment, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims.

Further, it is noted that, Applicants' intent is to encompassequivalents of all claim elements, even if amended later duringprosecution.

Having thus described our invention, what we claim as new and desire tosecure by Letters Patent is as follows:
 1. A computerized method todiscover hidden structures in documents stored in a repository ordocument collection, said method comprising: retrieving documents fromsaid repository, each retrieved document having one or morepreviously-identified markers, each said marker potentially serving as abasis for a template entry; clustering, as executed by a processor on acomputer, said retrieved documents into a plurality of clusters as basedon a preset threshold of a number of markers that are shared by saidretrieved documents, each said cluster representing a potential documenttemplate; and selecting from said plurality of clusters, clusters thatexceed a minimal cluster size, said selected clusters being output ascomprising distinct document templates represented by the documents insaid repository.
 2. The method of claim 1, wherein said clusters are aselected based on one of: an absolute number; a fraction of theretrieved documents; and a fraction of a total number of documents insaid repository.
 3. The method of claim 2, further comprising countingand reporting on the distinct document templates.
 4. The method of claim2, further comprising preliminarily determining said markers on saiddocuments.
 5. The method of claim 2, wherein weights are assigned tosaid shared markers used for said clustering.
 6. The method of claim 2,wherein all of said documents in said repository are retrieved for saidclustering.
 7. The method of claim wherein only a portion of saiddocuments in said repository are retrieved for said clustering, asrepresentative of said repository.
 8. The method of claim 7, whereinsaid portion of documents retrieved are selected randomly.
 9. The methodof claim 2, further comprising: retrieving one or more additionaldocuments from said repository; for each additional retrieved document,extracting a content from said retrieved document; and using saidextracted content to verify one or more of said distinct documenttemplates.
 10. The method of claim 1, as comprising a set of machinereadable instructions tangibly embodied in a tangible machine readablestorage medium.